Initial Loading

?library
## starting httpd help server ... done
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

ggdist Loading

library(ggdist)

Week 1 Assignment

To figure out how to load a library in RStudio, I used a quick Google search. I entered “how to load libraries in R” into the searched bar and opened the first two links, both of which explained to use the “library()” function. Running an additional code block using the help function certified the contents of those link. After adding it into the code block and running it, I was able to see that the library was successfully loaded into RStudio. I reset R and cleared the outputs and ran it once more to be sure.

https://bookdown.org/nana/intror/install-and-load-packages.html https://campus.datacamp.com/courses/intermediate-r/chapter-3-functions?ex=17

?tidyverse

Description

The ‘tidyverse’ is a set of packages that work in harmony because they share common data representations and ‘API’ design. This package is designed to make it easy to install and load multiple ‘tidyverse’ packages in a single step. Learn more about the ‘tidyverse’ at https://www.tidyverse.org.

Week 2 Assignment

MFID_analogy_read <- read_csv("/Users/vince/OneDrive/Documents/GitHub/Data2SciComm/WeeklyAssignments/tidy_data/MFIndD_analogy.csv")
## Rows: 792 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): qualtrics_id, match_type_0, match_type_1, response_type
## dbl (2): trial_number, response
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
MFID_analogy_read
  1. The column that contains the unique identifier for each participant is the “qualtrics_id” column.

2a. There are 792 rows and 6 columns in the dataset. I did not use a specific code chunk for this information because the number of rows and columns are provided when I initially read in the csv file for number 1.

MFID_analogy_read %>%
  distinct(qualtrics_id) %>%
  count()

2b. Using the distinct function, I was able to filter out repeated qualtrics ID’s for multiple trials of the same respondee. Therefore, there are 99 individuals in the dataset, as seen in the code chunk above.

MFID_analogy_read %>%
  group_by(qualtrics_id) %>%
  summarize(n_trials = n())

2c. As seen in the code chunk above, each of the 99 participants complete 8 trials, which was found using the count function n(). Therefore, every participant has the same number of trials.

Week 3 Assignment

MFID_summary <- MFID_analogy_read %>%
  group_by(qualtrics_id) %>%
  summarize(relational_matches = sum(response_type == "Rel"))

MFID_summary #make it show up
MFID_summary %>%
  ggplot(aes(x = relational_matches)) +
  geom_histogram(binwidth = 1, fill = "gray", color = "black") +
  labs(title = "Distribution of Relational Matches", 
       x = "Number of Relational Matches", 
       y = "Number of Participants")

  1. The histogram shows that participants generally tend to choose either relational matches or object matches across all trials. This is derived from the fact that an overwhelming majority of participants are at both extremes of the histogram.
MFID_analogy_read %>%
  group_by(qualtrics_id) %>%
  select(-match_type_0, -match_type_1, -response) %>% #taking additional info out
    pivot_wider(names_from = trial_number,
                values_from = response_type,
                names_prefix = "trial ") %>%
      select("trial 1", "trial 2", "trial 3", "trial 4", "trial 5", "trial 6", "trial 7", "trial 8") #reordering to make it look better
## Adding missing grouping variables: `qualtrics_id`
MFID_analogy_read %>%
  group_by(qualtrics_id) %>%
    pivot_wider(names_from = trial_number,
                values_from = response_type) %>% 
      select(-match_type_0, -match_type_1, -response) %>% #remove these columns
      select(1, 2, 3, 4, 5, 6, 7, 8)

Show My Work

This code chuck did work, just not in the way that I intended it to. After removing the unnecessary columns, I wanted to reorder the columns so that it would start at trial 1 and increment to trial 8. In the corrected code chunk, I added a prefix of “trials” so that I would be more specific. I then realized that I am still a little unfamiliar with the syntax for certain functions, thus forgetting to put the column names in quotations.

Week 4 Assignment

MFID_REI_read <- read_csv("/Users/vince/OneDrive/Documents/GitHub/Data2SciComm/WeeklyAssignments/tidy_data/MFIndD_REI.csv")
## Rows: 4000 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): qualtrics_id, sub_type, rev_scoring, response
## dbl (2): item_number, scored_response
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(MFID_REI_read)
MFID_REI_read
  1. It is likely that the columns for “response” and “scored_response” have different column types because they contain different data. The column type for response is because it is likely that the response can be more than just a number, resulting in the whole column being treated as a non-numeric character. It is likely that the column type for response is double because the column only contains numeric values.
MFID_REI_read_fixed <- MFID_REI_read %>%
  mutate(response = case_when(response == 2 ~ 2,
                              response == 3 ~ 3,
                              response == 4 ~ 4,
                              response == "Strongly Agree" ~ 5,
                              response == "Strongly Disagree" ~ 1))
MFID_REI_read_fixed
MFID_REI_read_mut <- MFID_REI_read_fixed %>% 
  mutate(new_scored_response = case_when(is.na(rev_scoring) == TRUE ~ response ,
                                          rev_scoring == "neg" ~ 6 - response))

MFID_REI_read_mut
MFID_REI_read_mut %>%
  mutate(resp_check = case_when(new_scored_response == scored_response ~ TRUE, 
                                new_scored_response != scored_response ~ FALSE)
         ) %>% 
  distinct(resp_check)

Week 5 Assignment

MFID_REI_read_mut %>% distinct(sub_type) 
REI_summary <- MFID_REI_read_mut %>% 
  group_by(qualtrics_id, sub_type) %>%
    summarise(total_score = sum(new_scored_response)) 
## `summarise()` has grouped output by 'qualtrics_id'. You can override using the
## `.groups` argument.
REI_summary 
na <- REI_summary %>% 
  filter(is.na(total_score))

na

1b. Yes, there are scores that are NA.

REI_sum_notna <- MFID_REI_read_mut %>%
  group_by(qualtrics_id, sub_type) %>%
  summarise(total_score = sum(new_scored_response, na.rm = TRUE))
## `summarise()` has grouped output by 'qualtrics_id'. You can override using the
## `.groups` argument.
REI_sum_notna
na2 <- REI_sum_notna %>%
  filter(is.na(total_score))

na2
MFID_combined <- REI_sum_notna %>%
  inner_join(MFID_summary)
## Joining with `by = join_by(qualtrics_id)`
MFID_combined
ggplot(MFID_combined, aes(x = total_score, 
                          y = relational_matches,
                          color = sub_type)) + 
  geom_point() +
  geom_smooth() +
  labs(title = "Relation Between REI Score and Analogy Score", 
       x = "REI Total Score",
       y = "Analogy Score") + 
  theme_minimal()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

  1. There is some variability in the different best fit lines per sub-type. However,it seems that there is little to no correlation between the total scoring of the REI sub-types and the number of relational matches made.

Week 6 Assignment

  1. loading data file
MFID_prob <- MFID_analogy_read <- read_csv("/Users/vince/OneDrive/Documents/GitHub/Data2SciComm/WeeklyAssignments/tidy_data/MFIndD_probtask.csv")
## Rows: 26136 Columns: 20
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (13): SubID, condition, discreteness, regularity, integration, major_col...
## dbl  (6): rt_toolong, rt_tooshort, rt_exclu, trial_index, block, rt
## lgl  (1): correct
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
MFID_prob

2a. create a vector

conditions <- MFID_prob %>%
  distinct(condition) %>%
  .$condition

conditions # vector to store conditions
## [1] "dots_EqSizeSep"  "dots_EqSizeRand" "blob_shifted"    "blob_stacked"

2b. calculate mean

mean_rt <- numeric(length(conditions))

#filter invalid trials
valid_trials <- MFID_prob %>%
  filter(rt_exclu == 0)

for (i in seq_along(conditions)) {
  
  condition_data <- valid_trials %>%
    filter(condition == conditions[i])
  
  mean_rt[i] <- mean(condition_data$rt, na.rm = TRUE) #put the calculated mean in the mean_rt vector 
  
}

mean_rt
## [1] 889.8202 879.6184 915.1974 866.0546

logical explanation:

create a vector for the mean reaction times. in order to find the mean of each condition, i first need to filter out all of the valid trials (where rt_exclu == 0). then i can loop per condition and check where the condition in the valid trials matches one of the conditions in our conditions vector. after that i need to calculate the mean for all of the reactions of that specific condition and put that in our mean reaction time vector.

summary_across <- valid_trials %>% 
  group_by(condition) %>% 
  summarize(across(c(correct, rt))) %>%
  summarize(mean_rt = mean(rt, na.rm = TRUE), # find the mean
            overall_accuracy = mean(correct, na.rm = TRUE)) # find the accuracy
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 'condition'. You can override using the
## `.groups` argument.
summary_across
summary_without <- valid_trials %>%
  group_by(condition) %>%
  summarize(mean_rt = mean(rt, na.rm = TRUE), # find the mean
            overall_accuracy = mean(correct, na.rm = TRUE)) #find the accuracy

summary_without 

Week 7 Assignment

MFID_prob
MFID_prob %>%
  group_by(condition) %>%
    summarise(across(c(rt, correct), mean)) %>%
      pivot_longer(c(rt, correct)) %>%
        ggplot(aes(y = value, x = condition)) +
        geom_point(color = "red") +
        facet_wrap(~name, scales = "free")

1a. From the plot, I assume that a higher proportion of correct scores results from a better reaction time or vice versa. Blob_stacked and dots_EqSizeRand have higher accuracy and a faster reaction time, while blob_shifted and dots_EqSizeSep are lower in accuracy with a slower reaction time.

1b. The first thing that I noticed were the values in the axes of the graphs. Typically, graphs with great variability tend to have large ranges that contribute to conclusions about relationships. For example, I expected to see something along the lines of a higher accuracy results in a slower reaction time. More specifically, the blob_stacked column stood out as it had the highest accuracy but the fastest reaction time.

1c. Some information that I find difficult to see is a clear relationship between the accuracy of the conditions and the mean reaction time. The conclusion I came to in 1a is a pure assumption from looking at it, but I feel as though there are ways to visualize the information better. Further descriptive information of the graph and its axes could also be helpful. There is no title to describe what data we are looking at, and the y-axis in particular lacks information about what the values for each graph are.

2a. summarize the data set

sum_prob <- MFID_prob %>%
  group_by(SubID, condition) %>%
    summarize(prop_corr = mean(correct))
## `summarise()` has grouped output by 'SubID'. You can override using the
## `.groups` argument.
sum_prob

2b. make the plot

sum_prob %>%
  group_by(condition) %>%
  summarize(mean_prop_corr = mean(prop_corr, na.rm = TRUE)) %>%
  
  ggplot(aes(y = mean_prop_corr, 
             x = condition),
             scale = 0:1) +
  
  stat_summary(fun = mean, geom ="point", color = "red") +
  
  theme_minimal() + 
  
  labs(x = "Condition", 
       y = "Mean Proportion of Correct",
       title = "Mean Accuracy of Responses by Condition on Probability Task") + 
  
  coord_cartesian(ylim = c(0:1))

2c. To make the graph less misleading, I added a descriptive title and y-axis so the reader knows the graph that they are looking at and what information is being displayed on it. I also changed the scale so that it better shows that the y-axis is a proportion.

3a. add distributional info

sum_prob %>%
  
  ggplot(aes(y = prop_corr, 
             x = condition)) +
  
  stat_summary(fun = mean, geom = "point", color = "red", size = 3) +
  
  geom_dotsinterval(dotsize = 0.5,
                    slab_color = "lightblue",
                    slab_alpha = 0.6) +
  
  theme_minimal() + 
  
  labs(x = "Condition", 
       y = "Mean Proportion of Correct",
       title = "Mean Accuracy of Responses by Condition on Probability Task") +
  
  ylim(0,1)

3b. What does this graph tell us: From my understanding, the dots represent individual cases of proportion correct for the different SubID’s. Thus, it allows us to see the variability across the data set and where responses were more concentrated. Overall, it gives us more information about why the mean proportion of correct for each condition is the way it is.

Aesthetics: As seen above, I changed the scale of the y-axis, added a descriptive title and y-axis title, set the theme to minimal. I made the actual mean points larger so that they would stand out more in comparison to the dots. For the geom_dotsintervals, I changed the color to light blue, reduced their size, and lessened their opacity so it didn’t overtake the importance of the mean.

Week 8 Assignment

wrangled <- MFID_prob %>% 
  group_by(SubID, condition) %>%
  summarize(rt = mean(rt, na.rm = TRUE),
            correct = mean(correct, na.rm = TRUE))
## `summarise()` has grouped output by 'SubID'. You can override using the
## `.groups` argument.
wrangled

3a. colors

wrangled %>% 
  
  ggplot(aes(x = rt, 
             y = correct,
             color = condition)) +
  
  geom_point(size = 3,
             alpha = 0.6) +
  
  theme_minimal() +
  
  labs(title = "Relationship Between Reaction Time and Accuracy Separated by Condition",
       x = "Average Reaction Time",
       y = "Proportion Correct",
       color = "Condition") 

3b. facets in separate plots

wrangled %>% 
  
  ggplot(aes(x = rt, 
             y = correct)) +
  
  geom_point(size = 3,
             alpha = 0.6) +
  
  theme_minimal() +
  
  labs(title = "Relationship Between Reaction Time and Accuracy Separated by Condition",
       x = "Average Reaction Time",
       y = "Proportion Correct",
       color = "Condition") +
  
  facet_wrap(~condition, scales = "free")

  1. Primary Observation: My primary observation from these graphs is that there seems to be a positive correlation between reaction time and the proportion correct. When looking at each graph individually, it seems that dots_EqSizeSep and blob_shifted tend to cluster towards lower reaction and low accuracy, dots_EqSizeRand tends to have an average reaction time and accuracy, and blob_stacked cluster towards a higher reaction time and higher accuracy.